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Introduction. Coronavirus, also known as COVID-19, was first detected in Wuhan, China, in December 2019. It is a 
family of viruses ranging from the common cold to severe acute respiratory syndrome (SARS). The symptoms of such a 
virus are similar to those of a cold or seasonal allergies. Like other respiratory viruses, it 1s mainly transmitted through 
airborne droplets when coughing or sneezing. Therefore, the recognition of COVID-19 requires careful laboratory 
analysis, and the reduction of recognition resources is a major challenge. On 11 March, 2020, the World Health 
Organization (WHO) declared COVID-19, caused by SARS-CoV-2, a pandemic, as there had been an exponential 
increase in cases worldwide, and demand for intensive beds and related structures had far exceeded existing capacity. 
The first examples of this are the regions of Italy. Brazil registered the first case of SARS-CoV-2 on 02/26/2020. 
Transmission of the virus in this country shifted very quickly from imported cases to local and, finally, community 
missions, with the Brazilian federal government announcing national community transmission on 03/20/2020. As of 
March 23, in the state of SAo Paulo with a population of about 12 million people, where the Israelita Albert Einstein 
Hospital is located, 477 cases of the disease and 30 related deaths were registered, and on March 27, there were already 
1223 cases of COVID-19 with 68 concomitant deaths. To slow the spread of the virus in the state of Sao Paulo, 
quarantines and social distancing measures were introduced. One of the motivations for this challenge is the fact that, in 
the context of an extensive healthcare system with the possible limitation of SARS-CoV-2 testing, it is not practical to 
test every case, and test results can only be used in testing the target subpopulation. The study objective is to build a 
model based on machine learning that can predict the detection of SARS-CoV-2 from medical data. For this, various 
classification models of machine learning are compared, and the best one to predict coronaviruses is determined. The 
comparison is based on individuals in class 1, 1.e., those with a positive test. Therefore, it is required to determine the 
machine learning model with the best response and F1 score for class 1. 

Materials and Methods. An open-source data set from the Israelita Albert Einstein Hospital in Sao Paulo, Brazil, was 
taken as a basis. The following machine learning models were used for the study: RandomForests (RF), K-Nearest 
Neighbor (KNN), Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT) and AdaBoost (AB), 
as well as the 10-time cross-validation technique. Some machine learning performance measures, such as accuracy, 
recall, and F1 score were evaluated. 

Results. Out of a total of 5,644 people tested during the COVID-19 pandemic, 5,086 people tested negative and 
558 people tested positive. At the same time, support for machine vectors showed the best results in detecting 
coronavirus with a recall of 75 % and an F1 score of 60 % compared to models: Random drill, KNN, LR, AB, and DT. 
Discussion and Conclusions. It was found that when using AB algorithms, greater accuracy is achieved, but the 
stability of the LSVM algorithm is higher. Therefore, it can be recommended as a useful tool for detecting COVID-19. 
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The coronavirus is a very severe acute respiratory syndrome caused by the SARS-COV-2 virus. This virus, 


1. Introduction 


which can infect humans or animals, was discovered in the Chinese region of Wuhan, more precisely in the province of 
Hubei, during the pneumonia epidemic of January 2020 [1,2]. It is therefore the seventh human coronavirus. To 
everyone's surprise, this virus spread worldwide, causing 318,599 deaths and 4,806,299 infected persons [3]. 

SARS-CoV-2, SARS-CoV and MERS-COV (Middle East Respiratory Syndrome Coronavirus) cause severe 
pneumonia with a mortality rate of 2.9 %, 9.6 % and 36 % respectively [4—6]. 

The other four viruses, namely OC43, NL63, HKU1, and 229E, are responsible for illnesses related to mild 
symptoms |7]. 

It should be noted that since the Covid-19 epidemic, there has been much speculation about the origin of this 
virus [8]. Some said that it was the result of work done in a laboratory. However, after studies conducted on genetic 
data, this hypothesis was dismissed [9]. Analysis and comparison with the genomes of previously known coronaviruses 
clearly show that SARS-COV-2 is different from other coronaviruses [8, 11]. The virus responsible for the coronavirus 
(SARS-COV-2) is similar to the SARS virus of bats [2]. Thus, the Covid-19 virus is believed to have originated from a 
bat coronavirus that became infectious to humans while acquiring genes specific to pangolin coronaviruses. It should be 
noted that the actual causes of Covid-19 are still unclear. 

The symptoms of Covid-19 are similar to those of seasonal flu. The disease is more severe in the elderly and 
in people who are vulnerable to certain chronic diseases. Patients with Covid-19 can have symptoms ranging from mild 
to severe. The most common symptoms are fever (83 %), cough (82 %) and breathlessness (31 %) [12]. In patients with 
pneumonia, the X-ray of the lungs shows numerous mottles and ground glass opacity [12, 13]. 

Gastrointestinal symptoms associated with patients with Covid-19 include vomiting, diarrhoea, and abdominal 
pain [12, 14]. 

We also see a decrease in lymphocytes and eosinophils, lower haemoglobin levels, and an increase in white 
blood cells and neutrophils [15-18]. 

The manifestation of Covid-19 in children is different from that in adults. In children, the symptoms are mild. 
However, in some children, we have seen severe and fatal cases [19-27]. 

Like all other viruses, Covid-19 is transmitted mainly by the respiratory route. Among these routes of 
transmission, we have droplet transmission, which is the most widespread [28, 29]. Other transmission routes exist, 
namely the faecal route, via saliva. Indeed, SARS-CoV-2 RNA was found in the stool of a patient with Covid-19 [31]. 
SARS-CoV-2 RNA can be detected on inanimate surfaces (door handles). People who have been in contact with these 
surfaces could be contaminated [29]. 

This model will make it possible to identify positive and negative cases from the dataset studied and the 
elements responsible for COVID-19. The proposed prediction model ensures that it tracks the results regarding this 
epidemic situation so that the huge economic losses, the spread of the community, the amount of detachment social 
gens can be detected and a precise decision can also be made accordingly. This method will allow government 


authorities to put in place preventive measures based on our future work to predict the onset of this disease in the future. 
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2. Data Resources and Methods 

The dataset used was uploaded to Kaggle. It is open source and available on this link 
kaggle.com/einsteindata4u/covidl19. This dataset contains anonymized data in accordance with best international 
practices and patient recommendations at the Israelita Albert Einstein Hospital in Sao Paulo, Brazil. This section 
describes the proposed approach and a detailed overview of the tasks. These tasks can help to understand and extract 
knowledge from COVID 19 data, which can help countries contain the spread of the virus, raise awareness, launch 
initiatives, determine if mitigation has a positive effect or not, identify other factors affecting the virus, etc. This will 
allow countries to prepare for what may happen in the near future. This could help save lives and alleviate the agony. 
Epidemiological information includes various characteristics of the case studied, including case identification, age, sex, 
target value, lymphocytes, leukocytes, monocytes, hco3, etc. 

2.1. Data Pre-processing 

In data analysis, the most important step is pre-processing. However, it is not clear what methods of pre- 
treatment the author used. This part must be completed. 

2.2. Data Transformation 

The data is transformed to be processed and stored in. xls for further processing. All data were normalized to 
have a mean of zero and a unit standard deviation. With a dataset containing 111 characteristics, data mining eliminated 
missing values (78 characteristics) and retained important characteristics (33). This exploratory analysis of the data also 
allowed us to identify two categories of characteristics, namely virus-related characteristics and blood-related 
characteristics. The target value is divided into two categories which are negative cases coded by 0 and positive cases 
coded by 1. 

The dataset from the Israelita Albert Einstein Hospital in Sao Paulo is divided into training and test data. 70 % 
of the data is used for predictive model training, and the remaining 30 % is used for testing. The objective of model 
training 1s to adapt the model using data from the training set. After the model is formed, the prediction models sound 
tested to evaluate performance in the test datasets. 

2.3. The Proposed Models 

This section describes the different machine learning models used in this paper. These models are: Random 
Drills (RF), K-plus Close Neighbors (KNN), Linear Support Vector Machine (SVM), Logistic Regression (LR), 
Decision Tree (DT), and AdaBoost (AB). 

Random Forest (RF) 

Random forests (RF) or random decision forests were first proposed in 1995. This is a general classification 
training method that tends to work better than traditional decision tree classification methods (Gangaie et al., 2019). 
Decision trees are the fundamental RF classifiers that vote for each of the forecasts, and the survival prediction is based 
on the majority voting method in each tree (Breiman, 2001). The accuracy of each tree and the independence of the 
trees from each other provide the reliability of the classification. We used 100 trees to predict two target classes, 
survival or death of patients with hepatitis. 

Nearest Neighbor (KNN) 

The K-Nest Neighbor (KNN) classifier is one of the most commonly used classification algorithms. This 
algorithm can be used in several applications. It saves all valid attributes and classifies new attributes according to their 
similarity dimension. KNN 1s a statistical recognition model method for detecting the different classes of a model. A 
tree data structure is used to determine the distance between the point of interest and the points in the training dataset. 
The attribute is classified by its neighbors. In the classification method, the value of k is always a positive integer 


closest to the neighbor. The nearest visions are selected from a set of classes or property values of the object. 
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Support Vector Machine (SVM) 

SVM-controlled learning method is used for classification and regression [29]. This algorithm is a relatively 
new approach and has performed well in recent years. The SVM classifier is based on linear classifiers and in the data 
separated by a row, the SVM isolates the objects in the specified classes. It can also identify and classify instances that 
are not supported by the data. The only extension of this algorithm is to perform a regression analysis to obtain a linear 
function, and another extension teaches to classify the elements to obtain a classification of individual elements. 

Logistic Regression Model (LR) 

Logistic regression is the corresponding regression analysis that should be performed when the dependent 
variable is dichotomous (binary). Like all regression analyses, logistic regression is predictive analysis. It is used to 
describe the data and explain the relationship between a dependent binary variable and one or more nominal, ordinal, 
interval or ordinal independent variables, report [30,31]. This approach assumes that the binary result follows a 
binomial distribution. 

Decision Tree (DT) Model 

The Decision Tree is a controlled learning method that is used to solve classification and regression problems, 
but it is more used to solve classification. This is a powerful classification method for disease prediction. This is a tree 
model where the internal nodes represent the characteristics of a data set, the branches represent the decision rules, and 
each leaf node represents a result. The decision tree consists of two nodes, a decision node and a leaf node. Decision 
nodes have multiple branches and are used to make a decision, while leaf nodes are the result of those decisions. 

Model AdaBoost (AB) 

AdaBoost, short for “Adaptive Boosting”, is the first boost algorithm proposed by Freund and Schapire in 
1996. Its goal is to turn weak predictors into strong predictors to solve classification problems. For classification, the 
final equation can be put under the heading below: 

F(x)= sign( Yin=1 9m fm@ ) (1) 
Where f,,, denotes the weak classifier m and m denotes the corresponding weight. AdaBoost can be used for face 
recognition, as it is a standard algorithm for detecting faces in images. AdaBoost is fast, requires no setup, and is simple 
and easy to program. Plus, it has the flexibility to be able to be combined with any machine learning algorithm. 

2.4. Evaluation of Performance Measures 

For the comparison of the different classification algorithms used in this paper, some metrics were evaluated. 
These are accuracy, recall, and Fl-score. These metrics are calculated based on true positives (TP), true negatives (TN), 
false positives (FP), and false negatives (FN). The standardized confusion matrix illustrates the relationship between 
classification results and predicted classes. The level of the classification performance is calculated by the number of 
samples correctly and incorrectly classified in each class. 


The accuracy is calculated based on the total number of correct predictions, defined as follows: 


TP+IN 
TP+FN+ITN+FP 


(2) 


Recall, or sensitivity, 1s the proportion of true positive predictions that have been correctly identified, defined as 


Accuracy = 





follows: 
Recall = — (3) 
TP+FN 
The F1 score is the harmonic mean of accuracy and recall, and it is calculated by: 
Score Fl = — (4) 
TP+5(FP+FN) 
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3. Result 


The objective of this paper is to compare the different models of machine learning for the detection of 
coronavirus. Our task was to find out which machine learning model has the best recall and fl-score for Class 1. The 
learning machine models used are: Radom drill, k-nearest neighbor, logistic regression, support vector machine, 
AdaBoost, and decision tree. Out of a total of 5,644 people tested for COVID-19, 5,086 people tested negative and 558 
people tested positive. The results of our study are presented in Figure 1 and Figure 3. These results show that the 
vector-machine gave better results with a recall of 75 % and an F1 score of 60 %. The different learning curves were 
also traced in order to understand the phenomenon of over-fitting and under-fitting Figure 2. Indeed, the learning curve 
is very well known to data scientists, the learning curve shows the efficiency and quality of learning of our machine 
learning model. Learning curves are widely used as a diagnostic tool in machine learning for algorithms that 
incrementally learn a training data set. This means that we increase our dataset by a certain step, and then we see the 


performance of our model. The model can be evaluated on the training dataset and on the exception validation dataset 


after each update during training, and it traces the measured performance. This can be represented as a curve. 
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Fig. 1. Classification report of different machine learning models 


0.93 
@.52 


8.88 
0.73 
8.8/ 


fi-score 


8.98 
0.33 


8.82 
8.61 
0.81 


fi-score 


8.91 
8.60 


8.86 
@./6 
8.87 


support 


95 
16 


itt 


fh | 
111 


support 


95 
16 


111 


us | 
111 


support 


05 
16 


111 
111 
111 


Information technology, computer science, and management 


N 
pt 


Advanced Engineering Research 2022. V. 22, no. 1. P. 67—75. ISSN 2687-1653 

















RandomForest AdaBoost 
0.9 
0.9 
0.8 
0.8 
0.7 
0.7 
0.6 
0.6 
0.5 
0.5 
0.4 
0.4 
0.3 
0.3 ; i 
— train score 02 —— train score 
—— validationscore —— validation score 
50 100 150 200 250 300 350 50 100 150 200 250 300 350 
SVM KNN 
1.0 
: —— train score 
—— train score ee 
~~ validation score gg -—— validation score 
0.9 A 
0.8 0.7 
0.7 06 
0.6 
0.5 
0.5 
0.4 
04 
0.3 
0.3 
0.2 
0.2 
50 100 150 200 250 300 350 es = bis ony — ~~ ay 
DecisionTree Logistic_Regression 
1.0 1.0 7 
—— train score 
—— validation score 
0.9 
08 
07 
0.6 
0.5 
0.4 
03 
— train score 0.2 
—— validation score 
0.2 
50 100 150 200 250 300 350 50 100 150 200 250 300 350 
Fig. 2. Learning curve of different machine learning models 
2 
7 
S mam KNN 
oO YN me LR 
I 
4 z mm LSVM 
— 
gs = ma RE 
sf is | ae AB 
2 2 
> DT 
mz 
zs 
+ 
S 
Precision Recall Fl-Score | Accuracy 
72 PERFORMANCE MEASURE 


Fig. 3. Results of predictions from various machine learning techniques 
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Figure 3 shows the performance of the different machine learning algorithms according to the performance 
measures used in this paper. We see that for recall and Fl-score, LSVM outperforms the other machine learning models 
used, namely LR, KNN, RF, AB, and DT. For accuracy, LR is much better than the others. As for accuracy, we find 
that LR and AB performed better than the other models. In this paper, we chose recall and F1 score to measure the 
performance of the model. Recall allowed us to correctly identify the Covid-19 positive test subjects among all the real 
positive cases. As for the Fl score, we used it because we had an imbalance between different classes, 1.e., positive and 
negative cases. 

4. Discussion and Conclusion 

The data used in this paper was collected at the Israelita Albert Einstein Hospital in Sao Paulo, Brazil. After an 
exploratory analysis, two categories of characteristics were identified. These are the characteristics related to the virus and 
the characteristics related to the blood. Out of a total of 5,644 people tested with COVID-19, 5,086 people tested negative 
and 558 people tested positive. The results of this study clearly illustrated that in relation to our goal, machine vector 
support showed better results in coronavirus detection with a recall of 75 % and an F1 score of 60 %. This co-calculation 
was done with the other machine learning models, namely the Radom drill, the k-nearest neighbor, the logistic regression, 
the AdaBoost, and the decision tree. As such, this model can be useful for the diagnosis of COVID-19. However, it is 
possible to optimize the parameters of this model in order to improve its performance. 

After the analysis of the learning curve in Figure 2, we find that apart from the supporting sensor, other 
machine learning models can be studied for the detection of COVID-19. These include AdaBoost and k-nearest 
neighbor. Indeed, we find that if we perform a little more advanced optimization of the parameters of these models, they 
could be candidates for the diagnosis of COVID-19 because the difference between the learning score curve and the 


validation score curve would have reduced the model's ability to generalize. 
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